This lab draws from materials developed by Rebecca Ferrell, Max Schneider and Jess Kunke for CSSS 592 (Longitudinal Data Analysis).
Here are all the libraries we’ll use today. You can go ahead and install them (if you don’t have them) and load them.
# if you need to install, e.g., knitr: install.packages("knitr")
library(knitr) # for making tables
library(dplyr) # for the pipe operator %>% and the group_by and summarize functions
library(tidyr) # for the spread function
library(ggplot2) # for plotting
Today we’ll work with a new data set, milk.csv, from a study in which cows were given different diets and the amount of protein in their milk was measured each week. Yesterday we worked with data from an R package, but we didn’t load data from a file, so here’s one of the many ways to do that:
milk <- read.table("milk.csv", header=TRUE, sep=",")
Reviewing some of the programming/R terms we used yesterday, here we’ve provided three arguments to the function read.table, two of which are named:
sep is the separator that delineates one column from another; in a csv (comma-separated value) file, values are separated by commas, so we used sep=",". Other common separators include tabs, a single space, or a semicolon.header=TRUE argument in the read.table function to tell R to read that first line as variable names instead of as data.
header=FALSE instead.)header argument? (Hint: try ?read.table.)So now let’s review some skills from yesterday as part of exploring this dataset.
# you can type your code here for 1-3 above!
Here are some other questions we might want to ask or things we might want to do, and we’ll talk today about how to address them using R:
This data already had a header, and the names are pretty intuitive. But sometimes one of those is not true for the dataset you’re working with, so here is how to rename your variables. First, let’s pretend we don’t have the header by skipping the first row and setting header=FALSE:
milk <- read.table("milk.csv", header=FALSE, skip=1, sep=",")
# check the variable names:
head(milk)
## V1 V2 V3 V4
## 1 Barley 1 1 3.63
## 2 Barley 1 2 3.57
## 3 Barley 1 3 3.47
## 4 Barley 1 4 3.65
## 5 Barley 1 5 3.89
## 6 Barley 1 6 3.73
colnames(milk)
## [1] "V1" "V2" "V3" "V4"
# change the variable names
# note: both colnames() and names() here do the same thing
colnames(milk) <- c("Diet", "Cow", "Week", "Protein")
colnames(milk)
## [1] "Diet" "Cow" "Week" "Protein"
Suppose we want to see how many observations we have for each diet- is it fairly balanced? Let’s see a way to do this using the dplyr library.
dplyr is an R package for manipulating data frames. It makes use of a special operator called a “pipe” (%>%) which means “take the object on the left and do the command on the right to it”. dplyr speeds up common operations, especially in data processing on particular variables in a data set. For example, with piping, you don’t need to keep retyping the name of a data frame when referencing its columns. This package also contains several useful commands for common data processing tasks.
# first let's see how this pipe works, using the example of the filter function
result1 <- filter(milk, Diet=="Barley")
# take a look at result1 here first, then run the next line
result2 <- milk %>% filter(Diet=="Barley")
sum(result1 != result2) # = 0, so the two results are identical
## [1] 0
# can also do identical(result1, result2)
Note that if your line ends with an operator like + or %>%, or with a comma, then R will expect your line to continue onto the next line; this is nice for splitting long or multi-step commands into multiple lines as you’ll see in the next (and other) examples.
So now let’s accomplish our objective here:
n_obs_by_diet <- milk %>%
group_by(Diet) %>%
summarize(n=n())
group_by() function groups the observations by the variable or variables you pass to it. This is like select() in that it doesn’t alter the data frame, just determines how the data are passed to the next function.summarize() function creates a new data frame with the results of evaluating the functions you pass as arguments to summarize() for each of the groups you formed previously using group_by(). (The documentation at ?summarize() will show you several of the functions you can pass to it.)summarize() is n(), which computes the number of elements in each of the groups.Since we group by diet and then summarize using the count function n(), the end result is a data frame whose rows are the diets, and each row of which contains the name of the diet and the number of observations for that diet (this includes all weeks, all cows).
We can display this in a table using the knitr package (or many other packages). Note: this table is formatted differently in PDF vs HTML knitted documents.
kable(n_obs_by_diet, caption="Number of observations by diet")
| Diet | n |
|---|---|
| Barley | 425 |
| Lupins | 453 |
| Mixed | 459 |
avg_protein <- milk %>%
group_by(Diet, Week) %>%
summarize(avg_protein_value=mean(Protein))
## `summarise()` has grouped output by 'Diet'. You can override using the `.groups` argument.
kable(avg_protein, caption="Average protein level by diet and by week")
| Diet | Week | avg_protein_value |
|---|---|---|
| Barley | 1 | 3.886800 |
| Barley | 2 | 3.642500 |
| Barley | 3 | 3.498000 |
| Barley | 4 | 3.376400 |
| Barley | 5 | 3.484400 |
| Barley | 6 | 3.386400 |
| Barley | 7 | 3.468800 |
| Barley | 8 | 3.503200 |
| Barley | 9 | 3.511739 |
| Barley | 10 | 3.518800 |
| Barley | 11 | 3.455000 |
| Barley | 12 | 3.429200 |
| Barley | 13 | 3.512400 |
| Barley | 14 | 3.506800 |
| Barley | 15 | 3.541579 |
| Barley | 16 | 3.601765 |
| Barley | 17 | 3.682000 |
| Barley | 18 | 3.640667 |
| Barley | 19 | 3.640000 |
| Lupins | 1 | 3.758148 |
| Lupins | 2 | 3.427778 |
| Lupins | 3 | 3.372963 |
| Lupins | 4 | 3.294074 |
| Lupins | 5 | 3.238077 |
| Lupins | 6 | 3.280370 |
| Lupins | 7 | 3.187200 |
| Lupins | 8 | 3.310000 |
| Lupins | 9 | 3.347037 |
| Lupins | 10 | 3.268846 |
| Lupins | 11 | 3.232593 |
| Lupins | 12 | 3.213704 |
| Lupins | 13 | 3.333704 |
| Lupins | 14 | 3.254074 |
| Lupins | 15 | 3.263500 |
| Lupins | 16 | 3.265000 |
| Lupins | 17 | 3.252000 |
| Lupins | 18 | 3.302000 |
| Lupins | 19 | 3.205714 |
| Mixed | 1 | 3.861111 |
| Mixed | 2 | 3.540000 |
| Mixed | 3 | 3.345556 |
| Mixed | 4 | 3.277778 |
| Mixed | 5 | 3.337778 |
| Mixed | 6 | 3.392593 |
| Mixed | 7 | 3.333333 |
| Mixed | 8 | 3.397692 |
| Mixed | 9 | 3.435185 |
| Mixed | 10 | 3.437037 |
| Mixed | 11 | 3.354815 |
| Mixed | 12 | 3.374074 |
| Mixed | 13 | 3.410769 |
| Mixed | 14 | 3.372222 |
| Mixed | 15 | 3.446000 |
| Mixed | 16 | 3.570588 |
| Mixed | 17 | 3.511250 |
| Mixed | 18 | 3.450625 |
| Mixed | 19 | 3.395714 |
Great! But maybe a better table format would be to have a 19 x 3 table, where each row is a week, each column is a diet, and the values in the table are the average protein values for that week-diet combination. For this, we want to turn this long-format table into a wide-format table. We’ll use the spread function from the tidyr package. Note that I also display fewer digits here.
avg_protein_wide <- avg_protein %>%
# spread takes these arguments:
# key=the variable to use for the names of the new columns in wide format,
# value=the current variable/column whose values will be the values in those new columns
spread(key=Diet, value=avg_protein_value)
kable(avg_protein_wide, digits=2, caption="Average protein level by diet and by week")
| Week | Barley | Lupins | Mixed |
|---|---|---|---|
| 1 | 3.89 | 3.76 | 3.86 |
| 2 | 3.64 | 3.43 | 3.54 |
| 3 | 3.50 | 3.37 | 3.35 |
| 4 | 3.38 | 3.29 | 3.28 |
| 5 | 3.48 | 3.24 | 3.34 |
| 6 | 3.39 | 3.28 | 3.39 |
| 7 | 3.47 | 3.19 | 3.33 |
| 8 | 3.50 | 3.31 | 3.40 |
| 9 | 3.51 | 3.35 | 3.44 |
| 10 | 3.52 | 3.27 | 3.44 |
| 11 | 3.46 | 3.23 | 3.35 |
| 12 | 3.43 | 3.21 | 3.37 |
| 13 | 3.51 | 3.33 | 3.41 |
| 14 | 3.51 | 3.25 | 3.37 |
| 15 | 3.54 | 3.26 | 3.45 |
| 16 | 3.60 | 3.27 | 3.57 |
| 17 | 3.68 | 3.25 | 3.51 |
| 18 | 3.64 | 3.30 | 3.45 |
| 19 | 3.64 | 3.21 | 3.40 |
Alternatively, the developers of tidyr have also created a replacement for spread called pivot_wider. You may find the pivot_wider function easier to understand.
avg_protein_wide_2 <- avg_protein %>%
# spread takes these arguments:
# names_from=the variable to use for the names of the new columns in wide format,
# values_from=the current variable/column whose values will be the values in those new columns
pivot_wider(names_from=Diet, values_from=avg_protein_value)
Let’s plot the protein levels (y-axis) by week (x-axis) for each cow (one line per cow), and we can color code the lines based on the cow’s diet.
library(ggplot2)
ggplot(data=milk, aes(x=Week, y=Protein, group=Cow, color=Diet)) +
geom_point() +
geom_line() +
ggtitle("Protein levels by week for each cow")
Let’s pick this apart a bit. Notice that if you just plot the first line with the ggplot() command, it doesn’t plot anything! It just sets up the plotting area.
ggplot(data=milk, aes(x=Week, y=Protein, group=Cow, color=Diet))
So let’s store this as a plotting object called milk_graph and then add things to it one by one to see what’s going on.
# Store the graph so far
milk_graph <- ggplot(data=milk, aes(x=Week, y=Protein, group=Cow, color=Diet))
# View what it looks like so far- should look the same as the blank plotting region above
milk_graph
# Let's add geom_point() - what does it do?
milk_graph +
geom_point()
# What does geom_line() do?
milk_graph +
geom_point() +
geom_line()
# What does ggtitle() do?
milk_graph +
geom_point() +
geom_line() +
ggtitle("Protein levels by week for each cow")
In this example, we are using the following:
aes):
x: \(x\)-axis, our time variabley: \(y\)-axis, our outcome variablegroup: our unit variablecolor: variable we want to use to segment by color+), each of which can take arguments:
ggplot: base layer containing information about the data and overall aestheticsgeom_point: scatterplot points at \(x\) and \(y\) valuesgeom_line: line connecting values within each groupggtitle: title for the plot. xlab and ylab work similarly for labeling the axes if our variable names are not descriptive enough.Additional aesthetics we can use are shape (symbols used for points), linetype, size (dot size/line thickness), alpha (transparency level, 0 for fully transparent and 1 for fully opaque), and fill (color for areas like bar charts).
We can change the general appearance of parts of the graph that aren’t related to specific variables (e.g. use large sized symbols for the geom_point layer). These are passed as non-aesthetic options to the layers; that is, in the specific layer and not enclosed in an aes() group in the ggplot() main layer.
For more information on modifying colors, legends, etc., see http://www.cookbook-r.com/Graphs/index.html. For specific R color names, see http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf. Shapes and linetypes can be modified in a similar way, with more information at http://www.cookbook-r.com/Graphs/Shapes_and_line_types.
ggplot2 gives us tons of options. For example, we can get rid of the points (no geom_point), get rid of the gray background (add theme_bw layer), and change the colors used (set scale_color_manual values). This is probably the best (cleanest and most useful) of the three plots so far:
# storing the graph to an object -- I can add more layers later!
milk_graph <- ggplot(data=milk, aes(x=Week, y=Protein, group=Cow, color=Diet)) +
geom_line(alpha=0.5) +
ggtitle("Protein levels by week for each cow") +
theme_bw() +
scale_color_manual(values=c("mediumorchid4", "darkgoldenrod1", "deepskyblue"))
milk_graph